﻿Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea and Cristina Butnariu University “Al I Cuza” of Iaşi Faculty of Computer Science and Institute for Theoretical Computer Science Romanian Academy – the Iaşi Branch {dcristea,cris}@infoiasi ro Abstract annotated corpora obtained in these ways could then be used to draw inter-layer correlations The paper proposes a scheme for hierarchical representation The paper reconsiders and enhances a hierarchical of XML annotation standards The representation allows scheme to represent annotation standards, proposed in individual work on documents displaying partial fitness in (Cristea et al , 1998), to which a processing machinery is markings, mixing of annotated documents observing or not the added Annotation standards are represented in a same standard, as well as concurrent annotation The approach hierarchy, which enables multiple views over a document allows access to different annotations of a corpus, with minimal representation overhead, which also facilitates accommodation Navigation within the hierarchy observes inheritance of different, even incompatible, annotations of the same data criteria The approach allows access to different Two methods to build a hierarchical representation of annotation annotations of a corpus, with minimal representation standards are shown, one allowing explicit declarations and the overhead, which also facilitates accommodation of other inferring the hierarchy from a set of consistently annotated different (and sometimes incompatible) annotations of the documents Merging and extraction operations, which produce same data The approach prefers a standoff encoding derived documents from existing ones are described A system scheme to an embedded one (Thompson and McKelvie, that implements the formal declarations of the hierarchy and the 1997) Potentially, the original hub (empty annotation) operations over it is presented document resides in an URL that could be different from the one on which the annotation is added Then, any 1 Introduction annotation brackets around a piece of text can be recorded The more deeply the linguistic research, the more separate from the flesh data through their beginning and sophisticated the annotation required Recently, since end character offsets onto the original text As such, the XML has become a de facto standard for the hub string, identical in all documents, serves as the representation of annotated corpus resources (Ide, absolute system of reference Bonhomme, and Romary, 2000), the sophistication of We show how relations between different markings types of processing over texts, speech or multi-media can be described in the hierarchy and how the directed documents resulted in the production of over-crowded acyclic graph representation can accommodate circular marked documents Annotation in corpora is not only used dependencies between annotations standards Two to record experts’ view on specific linguistic phenomena, methods to build such a graph are shown, one allowing but also to store intermediate results in pipe-line NLP explicit declarations and the other inferring the hierarchy architectures and to post NLP results on the Web from a set of consistently annotated documents The case (Cunningham et al , 2002) But not always and not for of concurrent annotations over the same hub document is each step in a processing chain are all layers of annotation discussed and a solution for contradictory (overlapping) useful Usually an NLP step uses as input a document representations is proposed Finally, we introduce a set of conforming to a certain annotation standard to which it operations that simplify an existent annotated document adds another layer of annotation Also, a human expert and combine two different annotations over the same hub uses a tool to annotate a certain document During the document into a unique one annotation process the expert can make use of some previous annotation layers that, through an interactive 2 The Hierarchy – a Lattice Representation tool, can help the task at hand In all cases, there are In our approach, different layers of annotation over a reasons to consider that, for a specific task (automatic corpus are codified as a hierarchy of annotation standards processing or manual annotation), some existing markings (directed acyclic graph, or DAG) A node in the hierarchy in the input document are useful while others are not and, is described according to the following syntax: therefore, could be obscured Examples of corpora use of this kind are corpus annotation in teamwork and re-usage could require individual experts or software modules, different layers the same original document Individual … results should be merged together in an attempt to tasks that employ existing corpora, to which … supplementary annotation layers are added Heavily and ST-VP, which are supposed to mark noun phrases A standard (node) name is a unique symbol in the (NPs) and verb phrases (VPs), respectively Tags of these hierarchy A standard inherits all features of all its parents kinds indicate also the heads of the corresponding To avoid conflicts, in the present implementation no compounds, as ids of TOK tags corresponding to the preference inheritance criteria are given, which means that headwords The ref definitions specify that the head- the features belonging to the parents of a node are id attribute of the NP and VP tags should be filled with supposed to be orthogonal Features which are new to a values of the id attribute of the TOK tags Then, ST- standard, vis-à-vis of those inherited, are defined in COREF, placed under ST-NP, is a standard, which intends between brackets by any to mark anaphoric links between co-referential NPs It number of and labels A label supplements the NP tag with a coref attribute The ref records a new XML element tag It has a name (label) and definition evidences the constraint that a coref attribute a list of attributes A label records a semantic of an anaphoric NP indicates the id attribute of the relation (dependency) between two annotation standards is a standard of an It describes a reference between an attribute, called antecedent NP ST-SEG-NP-VP source-attribute, belonging to an XML tag, called annotation, which marks simultaneously noun phrases, verb phrases and discourse units boundaries It adds no source-tag, of the current standard (the one that new markings to those inherited from its three parents contains the ref description) and another attribute, called Finally, ST-COREF-IN-SEG is a standard in which the target-attribute, belonging to another XML tag, coreferences and segment boundaries are marked, while called target-tag of a superior standard1 A standard ST-PAR-SEG-NP-VP adds the paragraph layer A is superior to a standard B if and only if there is a path annotation to the markings for NPs, VPs and edus from B to the root of the hierarchy that passes through A We say that a node A subsumes a node B in the hierarchy (therefore B is a descendent of A) if and only if: - any tag-name of A is also in B; - any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; - any semantic relation which holds in A also holds parents="ST-TOK"> in B; - either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is semantic relation which holds in B and which doesn’t hold in A head-id"/> As such, a hierarchical relation between a node A and one constrains Figure 1 displays an example of a declaration of a hierarchy of linguistic annotations The definition builds a lattice, as that in Figure 2, which intends to describe applications ST-ROOT represents the “empty” annotation (no tags), therefore describing the hub document of free text Immediately under this trivial standard, three standards, ST-TOK, ST-SEG and ST-PAR are placed ST-TOK is intended to identify tokens, as words and punctuation, and to mark words’ lemmas, ST-SEG marks borders between elementary discourse units (edus), like in (Marcu, 2000), and ST-PAR simply marks paragraphs ST-POS is placed under ST-TOK This standard does not adds the part-of-speech information through its attribute pos The standard ST-POS is a parent for both ST-NP 1 There is no a-priory motivation for which to call one attribute Figure 1: Declarations of a hierarchy of annotation source and another target, apart from the fact that, usually, the standards target attribute is the id attribute of the target tag Moreover all target attributes belong to nodes placed upper in the hierarchy have been ST-TOK ST-SEG ST-PAR ten or ST-POS eleven years old ST-NP ST-VP ST-COREF ST-SEG-NP-VP when ST-COREF-IN-SEG ST-PAR-SEG-NP-VP his mother Orwell’s novel “Ninety Eighty Four” annotated conforming to the ST-COREF-IN-SEG standard had disappeared ?> Winston lemma="she">She was was lemma="dream">dreaming a of lemma="tall">tall , his statuesque , mother rather lemma="silent">silent lemma="woman">woman He lemma="with">with must lemma="slow">slow , lemma="movement">movements lemma="he">he thought Figure 3: Example of annotation , 3 Representing Circular References It follows that the representation of XML standards From the above subsumption definition it follows that that we propose is not in contradiction with some a standard B that includes references to tags belonging to constrains which can have circular patterns another standard A should be placed under A in the hierarchy But what if A refers B as well? Imagine, for instance, an annotation standard in which we have VPs (verb phrases) and SEGs (elementary discourse units) and we want to record for each VP the unit it belongs to, and in each SEG the head VP Following the above observation, this would raise circularities, which are not name="VP" attributes="id"/> acceptable in a DAG structure Suppose ST-SEG and markings that contain the SEG tags, respectively the VP tags, and neither of the two includes references to the TO-VP, under both ST-SEG and ST-VP, is which the head VP Similarly, a standard ST-VP-TO-SEG, placed also under ST-SEG and ST-VP, will enrich the VP tags target-tag="SEG" the surrounding SEG Finally, a standard ST-SEG-VT, child of both ST-SEG-TO-VP and ST-VP-TO-SEG, from its parents without adding anything else The result is a hierarchy as that in Figure 4a, whose corresponding a description is given in Figure 5a However, if the intermediate standards ST-SEG-TO-VP and ST-VP- TO-SEG are not useful by themselves, they can be deleted without any loss, such that only the final ST-SEG-VT be kept, child of both ST-SEG and ST-VP, as in Figure 4b, and the description given in Figure 5b The circular-like parents="ST-POS"> constraints appear in the two ref declarations of the ST- SEG-VT standard ST-ROOT ST-SEG-VPb a Figure 5: Declarations of the hierarchies in Figure 4 ST-ROOT 4 Automatic Classification In order to interact with an existing hierarchy, one ST-SEG ST-VP should be able to automatically place a new document within it Two things are important here: compatibility of ST-SEG-VPnames and detection of semantic relations The first problem deals with name-spaces: in order for b a document to be compared against a hierarchy it should be compatible with the tag and attribute names populating Figure 4: Variants of hierarchical representations: the hierarchy If the annotations in the new document are without (a ) and with (b ) circular patterns of ref semantically identical with those in the hierarchy but there constraints exist name mismatches, compatibility can be achieved by a translation mechanism More complex compatibility nodes the witness collection is classified under forms a adjustments can be obtained by working on values, as for superior borderline In order to fulfil the process, an instance exploding a range of values of an attribute into inferior borderline must also be determined Two cases new attribute-value pairs 2 are possible: a) there is a set of nodes in the hierarchy The semantic-relations problem deals with recognizing which all have as parents the set of nodes on the superior domain values intersection The identity of values of two borderline and only these nodes and all the corresponding attributes or their intersection cannot be certified node collections satisfy the witness collection Then, the otherwise but by explicit declaration Automatic detection inferior borderline is given by the set of these nodes, and a is always prone to errors, which can be generated by new node should be included between the superior fortuitous value fitness borderline and the inferior borderline (see figure 6a) In our system, the classification module takes a b) either there is a set of nodes in the hierarchy which hierarchy and an XML document and classifies the have as parents the set of nodes on the superior borderline document within the hierarchy The header of the and only these nodes, but none of these node collections document should declare the list of the semantic relations satisfy the witness collection, or no common child of the as a collection of records, enclosed in a pair of superior borderline can be found Then, the inferior brackets … borderline is not defined in the hierarchy and a leaf node , each having the same is created and placed as child of the nodes belonging to syntax as in a hierarchy declaration: the superior borderline (see figure 6b) When the classification is completed, the search should continue beyond the nodes placed above the … nodes are found, they should be added to the list of parents of the classified node The classification process proceeds in two steps First 5 Concurrent Annotations the document to be classified is parsed and a collection of Two annotations are called concurrent if they intend to and declarations, having the same syntax represent the same linguistic phenomenon from different as in the hierarchy declaration, is compiled The perspectives, therefore possibly resulting in different records are computed by collecting all tags and their solutions corresponding attributes of the XML elements, and the Below we give two examples where concurrent records – by simply reading the declarations in the header Let’s call this a) Same standard, different target documents Often, computed collection of and , the witness in order to validate an annotated corpus, different teams collection Also, let’s call the proper and inherited receive the same task and their work is compared In case features of a node – the node collection of agreement, the common solution is adopted with a high The witness collection is matched against the node trust In case of mismatch, either the controversial collections of the hierarchy, from top to down, starting in versions are given to a third judge, who is asked to decide the root node The classification of the witness collection in favour of one of the two solutions, or the subjects are down the hierarchy, generally follows the programming persuaded to negotiate for an agreement In these cases by classification paradigm (Mellish&Reiter, 1993) We one would like to compile a unique document that keeps say that the witness collection satisfies the restrictions of the common annotations, while indicating also the a node collection of the hierarchy (or is classified under concurrent parts and the corresponding individual that node) if the features of the node collections represent markings In annotation tasks of these kinds it is likely a subset of the features of the witness collection, therefore that the agreed parts be significantly larger than the if all (name, attributes) pairs of the concurrent parts declarations of the node, and all (source-tag, b) Different standards and documents Suppose a source-attribute, target-tag, target-corpus has do be syntactically annotated with respect to attribute) quadruples of the declarations of two distinct linguistic theories In this case, two standards the node, proper and inherited, are part of the witness have to be considered It is not impossible to imagine a collection as well In this way the witness collection certain research task, for instance comparing a phrase ”falls” down the hierarchy reaching certain levels, structure and a dependency structure, tempting to check possibly more than just one Below those levels, the whether a certain clausal constituent is in a given witness collection cannot be classified any more under constituency relation within the sentence and in a certain none of the nodes found there The set of all down-most functional dependency with respect to the main verb It is likely that a pair of two such documents have no identical parts However, it is also likely that a granularity border 2 The morpho-syntactic descriptions, for example, use complex exists, up from where the two documents have the same attribute-value pairs (as msd=”Ncmso”), which can be structure and down from where the solutions are different expanded into a set of elementary features (pos=”noun” Then, in order to demonstrate the two approaches over the type=”common” gender=”masculine” number= ”singular” case=”obligue”) superi or borderline Î Î inferior borderline Î superior borderline Î i nferior borderline Î a b nodes the witness collection classifies under Legend: nodes the witness collection does not classify under the new node Figure 6: Examples of final classification The initialize-hierarchy operation takes a document, same text, a common layer of annotation should indicate headed by a semantic-relations statement, and at least this granularity limits (sentence, clause, etc ) builds a trivial hierarchy formed by the ROOT node (the Viewed from the perspective of the hierarchical graph empty annotation) and one standard corresponding to the representation, documents conforming to concurrent annotation in the document annotation cannot be combined in a common standard, at The classify operation takes an existing hierarchy and least for the reason that the target XML documents would a document, headed by a semantic-relations contain crossing markings However, supposing a unique statement, and classifies the document with respect to the document is preferable to two different versions (for the hierarchy, as described in section 4 It will end either by reasons exposed above), one should allow for both naming an existing standard in the hierarchy to which the common and concurrent annotations in the same document fully observes, or by placing a new standard in document Solutions to accommodate concurrent a certain place within the hierarchy As such, building of a annotations have been previously proposed (Wit, 2002), hierarchy can be done two ways: ad-hoc, by manually (Sperberg-McQueen&Huitfeldt, 1999) In our system we declaring it, when there is sufficient a-priori knowledge represent concurrency in annotation as follows: over a full range of corpus annotations, already existent or to-be-created, as shown in section 2; or corpus-driven, by an initialize-hierarchy command followed by any number text1 … of classify commands, when a range of annotated text2 documents are used to inseminate a hierarchy To note that in this case it is not compulsory for all annotated documents from which the hierarchy is triggered be text3 … replica of the same hub document When annotation text4 conventions are consistent within the collection of documents, different hub documents can be used to incrementally build a hierarchy of annotation standards where text1+text2=text3+text4 Given a graph of annotation standards and documents annotated corresponding to these standards, all having the same hub document, merge and extract operations can be 6 Operations within the hierarchy defined A merge combines two documents having Following the discussion above, our system identical hubs and corresponding to two distinct nodes of implements a set of operations, as described in this the hierarchy, which are not in a subsumption relation, section and produces, on one hand, another node in the hierarchy, descendant of the two input nodes, and, on the other, the corresponding document which contains the union of the Acknowledgements annotation tags of the two originating documents An extract applies the reverse operation, extracting from a The research presented in this paper has been partly document, corresponding to a certain node, a document supported by the EC funded IST-2000-29388 Balkanet conforming to one of the node’s ascendants in the project, and the Balkanet-MEC project funded by the hierarchy Romanian Ministry of Education and Research Finally, there are two types of concurrent-checks One receives a standard name and two XML files, annotated References versions of the same hub document, both supposed to Cristea, D , Ide, N and Romary, L (1998) Marking-up observe the standard, and produces a file in which the multiple views of a Text: Discourse and Reference, in annotation differences are put in evidence, as described in Proceedings of the First International Conference on section 5, example a The second receives two standard Language Resources and Evaluation, Granada names and two files corresponding to these standards, and Cunningham, H , Maynard, D , Bontcheva, K and produces a difference file, as described in section 5, Tablan,V (2002) GATE: A framework and graphical example b development environment for robust NLP tools and applications, in Proceedings of the 40th Anniversary 7 Conclusions Meeting of the Association for Computational We described a data structure and a system aimed at Linguistics facilitating the definition and exploitation of annotation Ide, N , Bonhomme, P , Romary, L (2000) XCES: An standards over corpora The system, interpreting the XML-based Standard for Linguistic Corpora, in hierarchy definition declarations and implementing the Proceedings of the Second Language Resources and described operations, has been built in Java It is freely Evaluation Conference (LREC), Athens, Greece available and can be downloaded from the address: Sperberg-McQueen, C M and Huitfeldt, C (1999) http://consilr info uaic ro/~pic/lc GODDAG: A Data Structure for Overlapping As further developments, we intend to supplement the Hierarchies, in Principles of Digital Document described operations with others, which will finally Processing, Munich, Berlin, Springer Verlag Early configure a complex environment, provided with a draft presented at the ACH-ALLC Conference in graphical interface, for working with annotated corpora Charlottesville This environment could include, for instance, Marcu, D (2000) The Theory and Practice of Discourse visualization of the hierarchy and interactive operations Parsing and Summarization The MIT Press over it, including the deletion of nodes under some Mellish, C and Reiter, E (1993) Using Classification as restrictions, unification of two hierarchies, cutting of a a Programming Language, in Proceedings of the 13th sub-hierarchy from an existing one, etc To unify different International Joint Conference on Artificial annotation with identical or close semantics, we also Intelligence (IJCAI-1993), Volume 1, pages 696-701 intend to complement the tag and attribute names with a Morgan Kaufmann declarative semantic description The final goal is to Thompson, H and McKelvie, D (1997) Hyperlink provide automatic conversion from an annotation name semantics for standoff markup of read-only space to another, when the associated tags are documents, in Proceedings of SGML Europe'97, semantically equivalent This will aim at keeping a strict Barcelona , D , Cristea, D and Stamou, S (2004) BalkaNet: control over annotation standards, avoiding the Tufiş proliferation of tag and attribute names Aims, Methods, Results and Perspectives A General We have acquired a collection of corpora, all based on Overview, to appear in Romanian Journal of George Orwell’s “Ninety Eighty Four” novel as hub Information Science and Technology, Romanian document, in both English and Romanian, on which the Academy, Bucharest program was tested In particular, a discourse parsing Witt, A (2002) Meaning and interpretation of concurrent application, at present under development, makes heavy markup, in Proceedings of ALLCACH2002, Joint use of the merging operations on a rich hierarchy of Conference of the ALLC and ACH, Tübingen standards Also all resources used for the development of the Romanian WordNet, under the Balkanet and Balkanet- MEC projects (Tufis, Cristea, Stamou, 2004), have been classified in a unique hierarchy with the described system The described system can help efforts oriented towards the standardisation of language resources To give an example, we intend to describe all resources which have been created or will be created for the Romanian language, and which are deposited on the site of the Consortium for the Romanian Language Technology, in an NLP-dedicated unique hierarchy Using this hierarchy, each document will be assigned to a node, whose corresponding standard it observes 